The influence of negative training set size on machine learning-based virtual screening
نویسندگان
چکیده
BACKGROUND The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. RESULTS The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. CONCLUSIONS In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
منابع مشابه
The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the s...
متن کاملEvaluation of different machine learning methods for ligand-based virtual screening
In silico High Throughput Screening of large compound databases has become increasingly popular technology of finding valuable drug candidates, by applying a wide range of computational methods, such as machine learning [1]. In recent years, many comparative studies of different machine learning methods performance in ligandbased virtual screening have been reported [2,3]. In order to extend th...
متن کاملPharmacophore Based Virtual Screening Approach to Identify Selective PDE4B Inhibitors
Phosphodiesterase 4 (PDE4) has been established as a promising target in asthma andchronic obstructive pulmonary disease. PDE4B subtype selective inhibitors are known toreduce the dose limiting adverse effect associated with non-selective PDE4B inhibitors. Thismakes the development of PDE4B subtype selective inhibitors a desirable research goal. Toachieve this goal, ligand based pharmacophore m...
متن کاملPharmacophore Based Virtual Screening Approach to Identify Selective PDE4B Inhibitors
Phosphodiesterase 4 (PDE4) has been established as a promising target in asthma andchronic obstructive pulmonary disease. PDE4B subtype selective inhibitors are known toreduce the dose limiting adverse effect associated with non-selective PDE4B inhibitors. Thismakes the development of PDE4B subtype selective inhibitors a desirable research goal. Toachieve this goal, ligand based pharmacophore m...
متن کاملThe influence of the inactives subset generation on the performance of machine learning methods
BACKGROUND A growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years. However, their effectiveness is strongly dependent on many different factors. RESULTS In this study, the influence of the way of forming the set of inactives on the classification process was examined: random and dive...
متن کامل